Improving Distributional Similarity with Lessons Learned from Word Embeddings

2017-07-11

Abstract

They reveal that much of the performance gains of word embeddings are due to certain system design choices, rather than the embedding algorithms themselves. Furthermore, they show that these modifications can be transferred to traditional distributional models, yielding similart gains. A recent study by Baroni et al. (2014) shows that new embedding methods consistently outperform the traditional methods by a non-trivial margin on many similarity-oriented tasks. But analysis by Levy and Goldberg shows that word2vec’s SGNS is implicitly factorizing a word-context PMI matrix.

Background

Four word representation methods are considered:

the explicit PPMI matrix
SVD factorization of said matrix
SGNS
GloVe

PPMI Matrix

$PMI(w,c)=log \frac{\hat{P}(w,c)}{\hat{P}(w)\hat{P}(c)}=log\frac{\#(w,c)\cdot |D| }{\#(w) \cdot \#(c)}$

$PPMI(w,c)=max(PMI(w,c),0)$

A well-known shortcoming of PMI, which persists in PPMI, is its bias towards infrequent events.

Transferable Hyperparameters

Adapt and apply the hyperparameters to count-based methods.

pre-processing hyperparameters
association metric hyperparameters
post-processing hyperparameters

Pre-processing hyperparameters

Dynamic Context Windows (dyn)

context words can be weighted according to their distance from the focus word.

Subsampling

Subsampling is a method of diluting very frequent words, akin to removing stop-words.

Deleting Rare Words (del)

Delete rare words before creating context windows.

Association Metric Hyperparameters

Shifted PMI(neg)
Context distribution smoothing

Post-processing hyperparameters

Adding context vector
Eigenvalue Weighting
Vector Normalization: the standard L2 normalization of $W$’s rows is consistently superior.

Experiments

Word Similarity

Six datasets:

WordSim-353: divided into two datasets: WordSim Similarity and WordSim Relatedness
MEN dataset
Mechanical Turk dataset
Rare Words dataset
SimLex-999 dataset

Analogy

MSR’s analogy dataset
Google’s analogy dataset

Results

At times, changing hyperparameters can bring bigger improvement than changing to different representation method. In some tasks, careful hyperparameters tunning can also outweigh the importance of adding more data.

SVD is very useful. word2vec outperforms GloVe.

The prediction-based word embeddings are not superior to count-based approaches. The contradictory results in stem from creating word2vec embeddings with somewhat pre-tunned hyperparameters (recommended by word2vec), and comparing them to “vanilla” PPMI and SVD representations.

3CosMul dominates 3CosAdd in every case.

There are a few works show that CBOW has s slight advantage compared to others. But in the paper of word2vec, it shows SGNS performs better.

Hyperparameter Analysis

Harmful Configurations

SVD does not benefit from shifted PPMI
Using SVD “correctly” is bad

Beneficial Configurations

PPMI and SVD‘s preference towards shorter context windows (win=2). SGNS always prefers numerous negative samples (neg>1). Only hyperparameter that can be “blindly” applied in any situation is context distribution smoothing (cds=0.75).

Practical Recommendations

Always use context distribution smoothing to modify PMI.
Do not use SVD “correctly” (eig=1).
SGNS is a robust baseline. While it might not be the best method for every task, it does not significantly underperform in any scenario. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory consumption.
With SGNS, prefer many negative samples.
For both SGNS and GloVe, it is worthwhile to experiment with $w+c$ variant.

Blog

Papers